Read and process data

Edit extreme values of age

Create bin groups for age for ease of classifying groups

Remove rows with more than 1 wrong answer in the word check test, we will allow 1 mistake but not 2. Note it is common for people to fill in surveys randomly without care and online surveys especially are prone to spam.

Removing rows for the top and bottom 2.5% of people who took the survey too quickly or slowly. The description to the data suggested it was normal to spend around 4 to 10 minutes on the survey. Extreme values distort the mean.

Adding DASS and TIPI scores to the dataset

Change column names to match question items

Note the original dataframe has not been changed because changing column names might create issues with automation. This is useful for the purposes of visuals and analysis only.

Find columns with NaN values and clean up columns with high entropy

In the column 'major' there are nearly 10k instances of NaN and 5k are different values; in 'country' most of them are from MY and US, while others are from the rest of the 143 countries. These variables have high entropy so we see if we could edit and group entries so we can make better sense of the data.

In a similar vein, the column for "education" needs to be standardised.

According to the description of the dataset, the values for 'education' should be 1 = Less than high school, 2 = High school, 3 = University degree, 4 = Graduate degree.

EDA - which questions are important in predicting scores in each DAS category

Visualise survey results by class

Plot a correlation heatmap for demographic features. There does not appear to be any signs of multicollinearity, aside from some association between age and marriage status which is to be expected.

Plot pie charts to understand the sample population

Notice the sample overwhelmingly comes from 18-24 year old, female, Malaysian, Muslim, Asian respondents. This raises concerns of sampling bias and suggests there may have been a controlled study that was undertaken via data collection done with this survey. The sampling population is certainly not representative therefore if we use demographics to predict DAS symptoms the results cannot be generalised.

Another study could be one reason for the high number of Malaysian Muslim entries, as we see more than 60% are degree-level, unmarried young adults, quite a lot of whom study engineering or psychology. Alternatively there could have been spammers using a Malaysian VPN to fill in the survey as we do not know from the data how VPNs affect the result. However, if we truncate the Malaysian entries, many of the other categories remain proportionally stable.

Plot correlation heatmaps to visualise the data

Emotional stability is negatively correlated with DAS score i.e. emotionally unstable individuals are likely to suffer from more severe symptoms of DAS.

Make correlation plot to find possible association between each question

Now extract the personality test results

About 52% of respondents moderately/ strongly agrees that they are anxious and easily upset.

The longest amounts of time were spent on question 35, followed by question 25 and 9.

Score by class, controlled for age

Minority and typically disadvantaged groups are most susceptible to worse DAS symptoms. However note that the distribution for racial groups especially is very unbalanced so this result cannot be generalised. Some of the trends shown by these plots are most likely a result of undersampling of certain groups.

Women consistently score higher than men, suggesting the stereotypes of men being typically avoidant in expressing negative feelings may be true. That said, we did not control for variables other than age so it's difficult to conclude that gender alone means women are more likely to be depressed, anxious or stressed than men.

It's difficult to make assumptions about the effect of sexual orientation from the plot above, however we can say that at least in this study non-heterosexual people collectively score higher on the DASS survey.

Survey result visualisation

Questions where the highest amount of respondents (> 30%) felt a certain way most of the time were, respectively:

13 - I felt sad and depressed.

11 - I found myself getting upset rather easily.

17 - I felt I wasn't worth much as a person.

34 - I felt I was pretty worthless.

40 - I was worried about situations in which I might panic and make a fool of myself.

On the contrary, most respondents answered "did not apply to me at all" to these questions:

23 - I had difficulty in swallowing. (63.95%)

15 - I had a feeling of faintness. (49.89%)

19 - I perspired noticeably (e.g. hands sweaty) in the absence of high temperatures or physical exertion. (47.07%)

Therefore most patients appear to have non-physical symptoms of DAS.

Correlation matrix of category scores against questions

Sorted question items based on correlation score

Visualise correlation of individual questions to each category

Sorted Pearson correlation of question items to each category

Random Forest Modelling

Recursive feature elimination with cross-validation to select optimal number of features

Random forest is an ensemble of n decision trees

We plot decision trees to understand the model better. Note we only plotted a decision tree of max depth 3, adhering to space constraint. Usually random forests consist of 100s or 1000s of decision trees of depth of around 20. The final classification output from the random forest is an ensemble of predictions of all decision trees.

Note that in each iteration of the code the trees generated will be different because the random forest is comprised of a committee of trees like the ones above.

Predicting the most important questions for the depression category using different models

Training a random forest classifier

Random state parameter is needed because random numbers are needed to generate random (bootstrap) samples on which trees are fitted. Each time a set of random numbers is generated, the program will generate a completely different set of random numbers which impacts your bootstrap samples and in turn the trees which are fitted. So in order to control the stochasticity involved in random number generation and to replicate the same set of random numbers every time we use a random seed. random_state is one parameter which allows you to set a random seed to your random number generation process in a random forest.

One main reason as to why need to set a random seed is for the purpose of replicability of the experiment. It is always better to set a random seed and start building your model, so that each and every time you build the model with the same data you get the exact same model.

This idea of setting a random seed is not only restricted to random forest, any algorithm which required random number (Neural Networks, Decision Trees etc.) will have this parameter. It does not need to be tuned however.

Impurity-based feature importances are often misleading for high cardinality features (e.g. features with many unique values). Permutation feature importance is an alternative model inspection technique and benefits from being model agnostic.

At max depth = 9, the model achieves the best accuracy.

Notice the model considers questions 10, 38, 16, 17, 34, 21 and 13 to be most indicative of a high depression score, if we were to use 7 items per subscale like the DASS-21. However in the clinical version of the DASS-21 the items chosen differ to ours, depression is measured with the following items (in order of factor loading from literature on factor analysis of the DASS-21): 21, 10, 3, 31, 17, 26, 42.

The only overlap exhibited by our choice of items is 21, 10 and 17, all of which are questions whose order did not change even in the 21-item version. This may suggest that the original authors of the survey had intended for these to be the most statistically significant answers and therefore did not change the order in which they are presented to the respondent, or that the ordering of the questions indeed affect the results given.

In other words, we can select the top 6 questions to predict depression score. The random forest classifier performs at around 71% accuracy in this case. Here we have chosen to select the minimum number of items that will preserve an accuracy above the threshold value of 70%, however the more items we include the better the predictions will be.

We then can use the test set to make predictions and show how our RF classifier model performs.

Show how the RF performs by making a few predictions

We simulate overall depression score based on some unseen, synthetic data.

Classification accuracy is the ratio of correct predictions to total predictions made. The main problems with assessing a model based on classification accuracy alone is that is doesn't give the details needed to diagnose the performance of a model.

For example, when your data has more than 2 classes or an uneven number of classes, we do not know if a high accuracy score is due to one or two classes being neglected by the model and the score is achieved by predicting the most common class value.

Confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. We can use it to evaluate the predictions made by the model because by definition the confusion matrix breaks down the number of correct and incorrect predictions by each class which gives insights into the types of errors made by the model.

Note each row of the matrix corresponds to a predicted class, each column of the matrix corresponds to an actual class.

Predictions for "extremely severe" scores are 89% accurate, however much less so for scores of 1 and 3.

Training a linear regression

Small MSE meaning prediction close to line of best fit/ regression line and the model has small errors. The model explains 87% of the variation of the outcomes.

Linear model outputs discrete predictions. The predicted values are the expected values given the current set of predictors. Note the mental distance between "applied to me considerably" and "applied to me very much" may be much greater for some respondents than the difference between "applied to me sometimes" and "applied to me considerably" - this is taken into account when the target variable is encoded in ordinal regression but ordinal data sould not be treated as numerical.

Generate diagnostic plots for linear model

As a consequence of the dependent variable having only a few values, residuals against fitted values are logically spread along parallel lines. Constant variance does not appear to be an issue here as bunching of points occur around the center of residual bands and not at the ends of fitted value bands. We will assume homoscedasticity is not violated.

Suffers from lighter tails, points are not normally distributed; more likely to see extreme values than to be expected if the data was normally distributed.

There are signs of non-linearity and heteroscedasticity.

There are no leverage points with Cook's distance greater than 0.5 in this plot and no points that are far out from the rest of the data.

Target factor best modelled with normal distribution with parameters mean = 2.274 and variance = 1.447. Recall that linear regression typically is only advised for continuous dependent variables, however it is possible although not best practice to model predicted overall depression score at the outcome variable. This is partly because score is increasing on a scale of 0 to 4, so the interval between scores are meaningful and interpretable.

However, we are encountering problems with the diagnostic plots and normality assumptions because the range of scores is small i.e. only 5 possible values. Perhaps transforming the target variable by aplying a log transformation is more suitable, furthermore we implement a ridge regression model following transformation as linear regression may not be the most suitable for highly intercorrelated predictors. Finally, try a logistic regression because from results in the cell above it is suggested that the logistic regression may be the best fit for the target variable.

Log ransformation on target variable in the linear model

The pattern does not change with a log transformation.

Log transformation does not give better predictions as can be seen from a lower R-squared value and greater median absolute error (MAE).

We deduce that regression is not suitable for the analysis at hand and rather, we should be using classifiers. Ridge regression has a classifier alternative which is faster than logistic regression.

Train a ridge classifier

What about the logistic regression? We see if that yields improved fitted values.

Train a logistic regression

Linear regression model with predictions rounded to match classes

Predictions generated by logistic regression mostly overlaps with that of random forest, exhibiting roughly the same level of predictive accuracy. The ridge classifier predictions tend to differ quite a lot from the rest of the models, suggesting it is the most unsuitable for this dataset.

Now we replicate the modelling process for the other two mental disorder categories.

Predicting the most important questions for the anxiety category using different models

Maximum depth = 10 gives best accuracy.

It is still possible to include 6 questions which describes around 71% of the original dataset. The top 6 questions for predicting anxiety are: 40, 7, 20, 36, 4 and 25.

The questions suggested by the model which overlaps with the DASS-21 are: 40, 20, 4 and 25. The questions that are deemed important by the DASS-21 but are not included by the random forest were mainly those that described physical symptoms of anxiety.

The model is able to predict 88% of "extremely severe" scores correctly and does similarly well for "normal" scores, however significantly falls short on predicting overall anxiety scores of 1 and 3.

Finally, predict which questions are most important in the stress category

Top features chosen to predict stress score are: questions 11, 27, 1, 6, 29 and 14. The only common item both suggested by the model and included on the DASS-21 is item 6 "I tended to over-react to situations" which did not change positions in the long and short form of the survey. However it was ranked 7th in CFA as opposed to 4th in RF feature importance.

The RF classifier is again accurate at descibing the data 68% of the time.

Compared to the DASS-21, the RF chose questions 11, 27, 1, 6, 29 and 14 as its optimal features.

This target variable distribution is closest to being normal

For all three mental health disorder categories, the model suggests including 6 questions for each category is advisable without losing too much of the information encapsulated by the original dataset.